----------------------------- AIRBNB CHALLENGE -----------------------------

Where will new guests book their first travel experience?

Malo Grisard, Guillaume Jaume, Cyril Pecoraro - EPFL - 15th of January 2017

Main Preprocessing:

Pipeline:

1. Data exploration and cleaning
2. Machine learning preprocessing
3. Machine learning optimization
4. Results

The purpose of this project is to predict which country a new user's first booking destination will be. We are given a list of users along with their demographics, web session records, and some summary statistics. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'.

In this notebook, we explored and cleaned the given data and try to show the most relevant features extracted.

The cleaned data are saved into separated files and analysed in the Machine Learning Notebook.

Important Remark: Due to size constraints, we were not able to load the file session.csv in the Git, you can upload it directly online from the Kaggle Competition here.


In [1]:
import pandas as pd
import os as os
import preprocessing_helper as preprocessing_helper
import matplotlib as plt
% matplotlib inline

1. Data exploration and cleaning

The dataset is composed by several files. First, we are going to explore each of them and clean some variables. For a complete explanation of each file, please see the file DATA.md.

1.1 file 'train_users_2.csv'

This file is the most important file in our dataset as it contains the users, information about them and the country of destination.

When a user has booked a travel through Airbnb, the destination country will be specified. Otherwise, 'NDF' will be indicated.


In [2]:
filename = "train_users_2.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()


Out[2]:
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser country_destination
0 gxn3p5htnn 2010-06-28 20090319043255 NaN -unknown- NaN facebook 0 en direct direct untracked Web Mac Desktop Chrome NDF
1 820tgsjxq7 2011-05-25 20090523174809 NaN MALE 38.0 facebook 0 en seo google untracked Web Mac Desktop Chrome NDF
2 4ft3gnwmtx 2010-09-28 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked Web Windows Desktop IE US
3 bjjt8pjhuk 2011-12-05 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked Web Mac Desktop Firefox other
4 87mebub9p4 2010-09-14 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked Web Mac Desktop Chrome US

There are missing values in the columns :

  • date_first_booking : users that never booked an airbnb apartment
  • gender : users that didn't wish to specify their gender
  • age : users that didn't wish to specify their age
  • first_affiliate_tracked : problem of missing data

We wil go each of these variable and take decisions regarding the missing values


In [3]:
df.isnull().any()


Out[3]:
id                         False
date_account_created       False
timestamp_first_active     False
date_first_booking          True
gender                     False
age                         True
signup_method              False
signup_flow                False
language                   False
affiliate_channel          False
affiliate_provider         False
first_affiliate_tracked     True
signup_app                 False
first_device_type          False
first_browser              False
country_destination        False
dtype: bool

Ages

There are 2 problems regarding ages in the dataset.

First, many users did not specify an age. Also, some users specified their year of birth instead of age.

For the relevancy of the data we will keep users between the age of 15 and 100 years old, and those who specified their age. For the others, we will naively assign a value of -1


In [4]:
df = preprocessing_helper.cleanAge(df,'k')


Percentage of users with irrelevant age 1.11 %
Percentage of users with NaN age 41.22 %
All the invalid or missing age were replaced by value -1

The following graph presents the distribution of ages in the dataset. Also, the irrelevant ages are represented here, with their value of -1.


In [5]:
preprocessing_helper.plotAge(df)


Gender

The following graph highlights the gender of Airbnb users. Note that around 45% did not specify their age.


In [6]:
df = preprocessing_helper.cleanGender(df)
preprocessing_helper.plotGender(df)


first_affiliate_tracked feature

Set the first marketing the user interacted with before the signing up to 'Untracked' if not specified.


In [7]:
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)

Date_first_booking

  • This has a high similarity with the dates where accounts were created. Despite the high growth of airbnb bookings throughout the years, it is possible to see that the difference between the months increases over the years as each year parabol curve increases.
  • By studying each year independently, we could see that four peaks arise each month corresponding to a certain day in the week. The following plots will show the bookings distribution over the months and later over the week
  • The most prolific day for Airbnb counted 248 bookings

In [8]:
df = preprocessing_helper.cleanDate_First_booking(df)
preprocessing_helper.plotDate_First_booking_years(df)


It is possible to understand from this histogram that the bookings are pretty well spread over the year. Much less bookings are made during november and december and the months of May and June are the ones where users book the most. For these two months Airbnb counts more than 20000 bookings which corresponds to allmost a quarter of the bookings from our dataset.


In [9]:
preprocessing_helper.plotDate_First_booking_months(df)


As for the day where most accounts are created, it seems that tuesday and wednesdays are the days where people book the most appartments on Airbnb.


In [10]:
preprocessing_helper.plotDate_First_booking_weekdays(df)


Save cleaned and explored file


In [11]:
filename = "cleaned_train_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)


file saved

1.2 file 'test_user.csv'

This file has a similar structure than train_user_2.csv, so here, we will just do the cleaning process here.


In [12]:
# extract file 
filename = "test_users.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
# process file
df = preprocessing_helper.cleanAge(df,'k')
df = preprocessing_helper.cleanGender(df)
df = preprocessing_helper.cleanFirst_affiliate_tracked(df)
# save file 
filename = "cleaned_test_user.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(df, fileAdress)


Percentage of users with irrelevant age 0.49 %
Percentage of users with NaN age 46.5 %
All the invalid or missing age were replaced by value -1
file saved

1.3 file 'countries.csv'

This file presents a summary of the countries presented in the dataset. This is the signification:

  • 'AU' = Australia
  • 'ES' = Spain
  • 'PT' = Portugal
  • 'US' = USA
  • 'FR' = France
  • 'CA' = Canada
  • 'GB' = Great Britain
  • 'IT' = Italy
  • 'NL' = Netherlands
  • 'DE' = Germany
  • 'NDF'= No destination found

All the variables are calculated wrt. the US and english. The levenshtein distance is an indication on how far is the language spoken in the destination country compared to english. All the other variables are general geographics elements. This file will not be used in our model as it does not give direct indications regarding the users.


In [13]:
filename = "countries.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df


Out[13]:
country_destination lat_destination lng_destination distance_km destination_km2 destination_language language_levenshtein_distance
0 AU -26.853388 133.275160 15297.7440 7741220.0 eng 0.00
1 CA 62.393303 -96.818146 2828.1333 9984670.0 eng 0.00
2 DE 51.165707 10.452764 7879.5680 357022.0 deu 72.61
3 ES 39.896027 -2.487694 7730.7240 505370.0 spa 92.25
4 FR 46.232193 2.209667 7682.9450 643801.0 fra 92.06
5 GB 54.633220 -3.432277 6883.6590 243610.0 eng 0.00
6 IT 41.873990 12.564167 8636.6310 301340.0 ita 89.40
7 NL 52.133057 5.295250 7524.3203 41543.0 nld 63.22
8 PT 39.553444 -7.839319 7355.2534 92090.0 por 95.45
9 US 36.966427 -95.844030 0.0000 9826675.0 eng 0.00

In [14]:
df.describe()


Out[14]:
lat_destination lng_destination distance_km destination_km2 language_levenshtein_distance
count 10.000000 10.000000 10.000000 1.000000e+01 10.00000
mean 39.799398 -4.262446 7181.897800 2.973734e+06 50.49900
std 24.739112 63.700585 3941.141734 4.329619e+06 44.54169
min -26.853388 -96.818146 0.000000 4.154300e+04 0.00000
25% 39.639090 -6.737559 7001.557600 2.580425e+05 0.00000
50% 44.053092 -0.139014 7603.632650 4.311960e+05 67.91500
75% 51.891219 9.163386 7842.357000 5.966865e+06 91.39500
max 62.393303 133.275160 15297.744000 9.984670e+06 95.45000

1.4 file 'age_gender_bkts.csv'

This file presents demograhpic statistics about each country present in our dataset. This file will not be used in our model.


In [15]:
filename = "age_gender_bkts.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()


Out[15]:
age_bucket country_destination gender population_in_thousands year
0 100+ AU male 1.0 2015.0
1 95-99 AU male 9.0 2015.0
2 90-94 AU male 47.0 2015.0
3 85-89 AU male 118.0 2015.0
4 80-84 AU male 199.0 2015.0

Population total per country

The following table shows the population in the country in 2015. These numbers correspond to data that can be found on the web.


In [16]:
df_country = df.groupby(['country_destination'],as_index=False).sum()
df_country


Out[16]:
country_destination population_in_thousands year
0 AU 23923.0 84630.0
1 CA 35871.0 84630.0
2 DE 82562.0 84630.0
3 ES 47203.0 84630.0
4 FR 64983.0 84630.0
5 GB 63840.0 84630.0
6 IT 61145.0 84630.0
7 NL 16848.0 84630.0
8 PT 10609.0 84630.0
9 US 325132.0 84630.0

1.5 file 'sessions.csv'

This file keeps a track of each action made by each user (represented by their id). For each action (lookup, search etc...), the device type is saved so as the time spend for this action.


In [20]:
filename = "sessions.csv"
folder = 'data'
fileAdress = os.path.join(folder, filename)
df = pd.read_csv(fileAdress)
df.head()


Out[20]:
user_id action action_type action_detail device_type secs_elapsed
0 d1mm9tcy42 lookup NaN NaN Windows Desktop 319.0
1 d1mm9tcy42 search_results click view_search_results Windows Desktop 67753.0
2 d1mm9tcy42 lookup NaN NaN Windows Desktop 301.0
3 d1mm9tcy42 search_results click view_search_results Windows Desktop 22141.0
4 d1mm9tcy42 lookup NaN NaN Windows Desktop 435.0

NaN users

As we can see, there are some missing user_id. Without a user_id, it is impossible to link them with the file train_user.csv. We will delete them as we cannot do anything with them.


In [21]:
df.isnull().any()


Out[21]:
user_id           True
action            True
action_type       True
action_detail     True
device_type      False
secs_elapsed      True
dtype: bool

In [22]:
df = preprocessing_helper.cleanSubset(df, 'user_id')


0.33 % have been removed from the original dataframe

Invalid session time

If a session time is NaN, there was probably an error during the session. We are not going to remove the rows correponding, because there are still some data interesting for the actions variable.

Instead, we are naively going to assign them a value of -1.


In [23]:
df['secs_elapsed'].fillna(-1, inplace = True)

Actions

Some action produce -unknown- for action_type and/or action_detail. Sometimes they produce NaN. We replace the NaN values with -unknown- for action_type,action_detail, action


In [24]:
df = preprocessing_helper.cleanAction(df)


0.0 % have been removed from the original dataframe

As shown in the following, there are no more NaN values.


In [25]:
df.isnull().any()


Out[25]:
user_id          False
action           False
action_type      False
action_detail    False
device_type      False
secs_elapsed     False
dtype: bool

Total number of actions per user

From the session, we can compute the total number of actions per user. Intuitively, we can imagine that a user totalising few actions might be a user that does not book in the end. This value will be used as a new feature for the machine learning.

Note: The total number of actions is represented on a logarithmic basis.


In [26]:
# Get total number of action per user_id
data_session_number_action = preprocessing_helper.createActionFeature(df)

# Save to .csv file
filename = "total_action_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_session_number_action, fileAdress)

# Plot distribution total number of action per user_id
preprocessing_helper.plotActionFeature(data_session_number_action)


file saved

Device types

There are 14 possible devices. Most of the users however are distributed on three main devices.


In [27]:
preprocessing_helper.plotHist(df['device_type'])


Time spent on average per user

The figure below shows the time spent on average per user. The following plot relates to the Total number of actions one with even clearer two gaussians. We display only time > 20s.

This value will also be used as a feature for the machine learning.


In [28]:
# Get Time spent on average per user_id
data_time_mean = preprocessing_helper.createAverageTimeFeature(df)

# Save to .csv file
data_time_mean = data_time_mean.rename(columns={'user_id': 'id'})
filename = "time_mean_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_mean, fileAdress)

# Plot distribution average time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_mean['secs_elapsed'],'mean')


file saved

Time spent in total per user

The figure below shows the total amount of time spent per user. We display only time > 20s

This feature is the 3rd and last one used for the machine learning from the file session. Intuitively, a long time spent leads to a booking and possibly further destinations.


In [29]:
# Get Time spent in total per user_id
data_time_total = preprocessing_helper.createTotalTimeFeature(df)

# Save to .csv file
data_time_total = data_time_total.rename(columns={'user_id': 'id'})
filename = "time_total_user_id.csv"
folder = 'cleaned_data'
fileAdress = os.path.join(folder, filename)
preprocessing_helper.saveFile(data_time_total, fileAdress)

# Plot distribution total time of session per user_id
preprocessing_helper.plotTimeFeature(data_time_total['secs_elapsed'],'total')


file saved

Distribution of time spent

This last graph shows the distribution of time spent in second per session. We display only time > 20s


In [30]:
preprocessing_helper.plotTimeFeature(df['secs_elapsed'],'dist')


Conclusion on the preprocessing

Through this notebook, we explored all the files in the dataset and displayed the most relevant statictics. From the file session, we constructed features to reinforce the train_user2 file.

Starting from the cleaned data generated, we are now able to design a machine learning model. This problem will be adressed in the second notebook Machine Learning.